Skip to content

Conversation

@samxbr
Copy link
Contributor

@samxbr samxbr commented Nov 4, 2025

Adds a new metric es.reindex.completion.total to track the number of completed reindex operations, along with these metric attributes to identify different results:

  • error.type: if present, indicates the reindex failed with the specified exception. Otherwise indicates the reindex was sucessful
  • reindex.source: local or remote, indicates whether the source cluster was the local or a remote cluster
    • this attribute is also added to the existing es.reindex.duration.histogram metric

@samxbr samxbr added :Data Management/Indices APIs APIs to create and manage indices and templates >non-issue labels Nov 5, 2025
@samxbr samxbr marked this pull request as ready for review November 5, 2025 04:28
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Nov 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @samxbr . I haven't done a full review on this, since you said you were looking for early feedback, but here are some initial thoughts.

We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.

@samxbr
Copy link
Contributor Author

samxbr commented Nov 6, 2025

We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.

Good question, I assume you are referring to the Reindexing tasks failures logged during shutdown graph, I think it is searching for this log, judging from the search query:

         {
          "match_phrase": {
            "log.logger": "org.elasticsearch.node.ShutdownPrepareService"
          }
        },
        {
          "match_phrase": {
            "log.level": "WARN"
          }
        },
        {
          "match_phrase": {
            "message": "*reindex*"
          }
        }

We probably need to implement something else to capture remote/local there, I can take a deeper look on that later.

@PeteGillinElastic
Copy link
Member

We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.

Good question, I assume you are referring to the Reindexing tasks failures logged during shutdown graph, I think it is searching for this log, judging from the search query:

         {
          "match_phrase": {
            "log.logger": "org.elasticsearch.node.ShutdownPrepareService"
          }
        },
        {
          "match_phrase": {
            "log.level": "WARN"
          }
        },
        {
          "match_phrase": {
            "message": "*reindex*"
          }
        }

We probably need to implement something else to capture remote/local there, I can take a deeper look on that later.

Yes, that's what I was referring to. Thanks.

@PeteGillinElastic
Copy link
Member

We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.

Good question, I assume you are referring to the Reindexing tasks failures logged during shutdown graph, I think it is searching for this log, judging from the search query:

         {
          "match_phrase": {
            "log.logger": "org.elasticsearch.node.ShutdownPrepareService"
          }
        },
        {
          "match_phrase": {
            "log.level": "WARN"
          }
        },
        {
          "match_phrase": {
            "message": "*reindex*"
          }
        }

We probably need to implement something else to capture remote/local there, I can take a deeper look on that later.

Yes, that's what I was referring to. Thanks.

Looking at the code, I don't think it's going to be straightforward to get the remote/local info in there. The code you linked is generic task-related code, and we are only identifying reindex tasks via a regex on the task name. I assume that changing the task name so that it included remote would be somewhat involved, and might be risky if other stuff is dependent on the naming convention (not sure whether it is).

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sam! Just one substantive comment (and one nit).

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=afd3a426eabdfda7d4fd6b0c52d76162e3c9c47e
HEAD=26abb9d1597bc46b560996f1854ea01e858f061f
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 8, 2025
BASE=afd3a426eabdfda7d4fd6b0c52d76162e3c9c47e
HEAD=26abb9d1597bc46b560996f1854ea01e858f061f
Branch=main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Indices APIs APIs to create and manage indices and templates >non-issue Team:Data Management Meta label for data/management team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants